Complex information extraction (IE) pipelines assembled by plumbing togetheroff-the-shelf operators, specially customized operators, and operators re-usedfrom other text processing pipelines are becoming an integral component of mosttext processing frameworks. A critical task faced by the IE pipeline user is torun a post-mortem analysis on the output. Due to the diverse nature ofextraction operators (often implemented by independent groups), it is timeconsuming and error-prone to describe operator semantics formally oroperationally to a provenance system. We introduce the first system that helpsIE users analyze pipeline semantics and infer provenance interactively whiledebugging. This allows the effort to be proportional to the need, and to focuson the portions of the pipeline under the greatest suspicion. We present ageneric debugger for running post-execution analysis of any IE pipelineconsisting of arbitrary types of operators. We propose an effective provenancemodel for IE pipelines which captures a variety of operator types, ranging fromthose for which full or no specifications are available. We present a suite ofalgorithms to effectively build provenance and facilitate debugging. Finally,we present an extensive experimental study on large-scale real-worldextractions from an index of ~500 million Web documents.
展开▼